ESTIMATING THE DISTRIBUTION OF TREATMENT EFFECTS 
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Abstract. In this paper we show that the distribution of treatment effects is point identified in a 
model where the outcome equation is of unrestricted form and the selection equation contains more than 

O '. 

^ one unobservable. This is different and economically better motivated than the treatment effect on the 

^ I distribution, usually the quantiles, which is commonly analyzed in the literature. Our key identifying 

, assumption in the selection equation is a linear random coefficients structure and the assumption that 

, the instruments are continuously distributed. This allows point identification of the entire distribution 

of treatment effects under conditions on unobserved heterogeneity that unlike the case of additively 
, separable/monotonic scalar unobservables have a clear economic interpretation in terms of unobserved 

heterogeneity. Also, we obtain results on the distribution of treatment effects without invoking any 
scalar monotonicity assumption in the outcome equation. Moreover, the identification is constructive 
and suggests estimators of various quantities of interest by sample counterparts. 



00 



■ 1. Introduction 

I Motivation In this paper we consider estimating the distribution of treatment effects in the 

. following structm'al Roy type treatment effects model 

o . 

(1.1) Y = A + BX 

(1.2) X = i{z^r>o} 

X 

H' (1.3) Z is independent of (A, B.r"^). 

Cd \ 

In the outcome equation (1.1), A and B are scalar random coefficients and 1 denotes the indicator 
function. In the treatment effect literature A is usually denoted as Iq and B = Yi — Yq. i.e., A 
denotes the distribution of outcomes in the control group, and B the treatment effects. In the binary 
treatment case, the linearity in the outcome equation is unrestrictive. 

The object of interest in this paper is to obtain the distribution of treatment effects fs in the 
case of endogenous selection into treatment, and moments of the distribution like the mean (average) 
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and the variance of treatment effects. We model the binary choice to participate in the treatment in 
equation (1.2). The participation decision involves Z, a L-vector of instruments, and F, a random 
vector that accounts for heterogeneous preferences and information which in turn governs the selection 
into treatment. We allow for the first component of Z to be unity and for the first component of T to 
absorb the usual stochastic shock term. 

As is evident from the equations (1.1)-(1.3), we do place some structure on the endogenous 
selection into treatment. First, we assume that we have instruments that are fully independent from 
the unobservables in the system, and, as we shall see below, also continuously distributed. The key 
identifying restriction is the linear random coefficients structure in the selection equation. While 
the linearity is clearly restrictive, it allows us to model several unobservables in a way that is more 
reminiscent of structural economics. Note that even if interest centers only on the average treatment 
effects, introducing some type of structure is necessary as there is not even point identification of the 
average of the distribution of treatment effect (ATE) in general^. 

Since the aim of this paper is to propose a minimal structure that point identifies the mean, 
the variance, as well as the entire distribution of treatment effects, we thus have to restrict the 
structure of the model at some point. We feel that our linear random coefficients structure is well 
motivated by heterogeneity in a population of economic agents: it corresponds to the notion that 
on individual level linearity of the selection equation is (at least approximately) a valid description 
of behavior. Depending on the feature of the distribution, we will require further assumptions. If 
the focus is on the mean of the distribution, we do not require any additional assumptions (apart 
from regularity conditions). For the variance of treatment effects, we provide sharp bounds, and 
show point identification under a covariance restriction. Finally, we establish identification of the 
distribution of treatment effects under the assumption that A _L i?|F. We will discuss this assumption 
in detail below. At this point we would like to point out that this assumption is satisfied, if there 
exist otherwise unrestricted mappings ipAji^B such that A = iPaO^jUa), and B = tps^^ ^Ub)^ with 
Ua, Ub possibly infinite dimensional, such that Ua -L Ub -L F. This corresponds to a notion that the 
selection equation reveals information about the common endogenous factors; there is (potentially 
complicated) remaining heterogeneity in A and B, but it is independent of everything else. 



Indeed, Shaikh and Vytlacil (2009) provide sharp bounds for this efTect which characterize the identified set, and 
Imbens and Angrist (1994) show that even under a monotonicity assumption on instruments, only the average efTect for 
"comphers" (local average treatment effect, LATE), a specific subpopulation defined by instruments is point identified. 
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The main results in this paper estabhsh that under the respective conditions, for B = Yi — Yq, 

E[B\r = 7], Var[Yi - | T = 7], and fB\r{b;i) 

are (point) identified. These objects correspond to generahzations of the Heckman and Vytlacil (2005) 
MTE to several sources of heterogeneity. From these results, we show that the unconditional average, 
variance and distribution treatment effect may be identified, but it requires an identification at infinity 
type of assumption. However, even in the absence of identification at infinity, these results can also 
be employed as building blocks to obtain policy relevant treatment effects, as in Carneiro, Heckman 
and Vytlacil (2010). Finally, and equally importantly, we establish that in the general case with 
limited support for our instruments, we can only identify the average, variance or distribution of 
effects for a subpopulation defined by the range of the instrument, a population that is related to the 
one considered in Angrist, Graddy and Imbens (2000). 

Based on these identification results, we provide sample counterparts estimators. It is known 
from Hoderlein, Klemela and Mammen (2010) and Gautier and Kitamura (2009) that the estimation 
of the distribution of random coefficients in the exogenous single equation cases are ill-posed inverse 
problems. It is clear from our identification argument that the estimation of the distribution of the 
treatment effect B is a different ill-posed inverse problem, akin to conditional deconvolution with noise 
and mixing distributions that are unknown but estimable by solving themselves inverse problems. We 
provide such estimators, and analyze their large sample behavior. 

Literature Naturally, this paper touches upon two related sets of literatures; the first is the 
treatment effect literature, in particular the part that is related to distributional treatment effects, the 
second is the random coefficients literature. Key references for the former are the quantile treatment 
effect of Abadie, Angrist and Imbens (2002), Chernozhukov and Hansen (2005) and Heckman, Smith 
and Clements (1996). Note that the first two results essentially require a rank invariance assumption, 
i.e., the individuals retain their ordering both in the treatment and the control group, an assumption 
which may only be slightly weakened. This assumption is restrictive, and has rightfully been criticized, 
see Heckman, Smith and Clements (1996), who point out that what is of interest is the distribution 
of treatment effects, and not the treatment effect on the distribution, and who provide (Frechet-) 
bounds for this quantity. In contrast, we provide point identification of the distribution of effects 
under different assumptions. As an implication, we can also obtain results for the average treatment 
effect, which is the expected value of the distribution of B. This is related to the seminal contribution 
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of LATE (Imbens and Angrist(1994)) and MTE (Heckman and Vytlacil (1999, 2005, 2007), henceforth 
HV). 

The second Une of work which is popular to model unobserved heterogeneity and related is, as 
mentioned, random coefficient models. Random coefficient models allow the preference or production 
parameters to vary across the population. In this paper, we allow for different individuals to have 
different preferences for treatment and that effect of treatment might differ for each individual. We 
emphasize the nonparametric aspect of our analysis, which allows to be flexible about the form of 
unobserved heterogeneity. References in econometrics include Elbers and Ridder (1982), Heckman 
and Singer (1984), Beran and Hall (1992), Ichimura and Thompson (1998), Fox and Gandhi (2009), 
Hoderlein, Klemela and Mammen (2010), Gautier and Kitamura (2009) and Gautier and Le Pennec 
(2011). The last two references recognize that the estimation of the density of the latent random 
coefficients vector is a statistical inverse problem. The literature on the treatment of these problems 
is extensive in statistics and econometrics (see, e.g., Carrasco, Florens and Renault (2007) for a survey 
of applications in economics). Fox and Gandhi (2009) are the first to study the identification of the 
distribution of unobserved heterogeneity in treatment effects models, however they do not allow for 
an intercept in the binary choice model. 

The two equations model in this paper combines the random coefficients linear model studied 
by Beran and Hall (1992), Hoderlein, Klemela and Mammen (2010) and the random coefficients binary 
choice model studied by Ichimura and Thompson (1998), Gautier and Kitamura (2009), and Gautier 
and Le Pennec (2011). The first equation though has a regressor which is a dummy variable whose 
effect varies by a random coefficient. This is not handled in Beran and Hall (1992) or Hoderlein, 
Klemela and Mammen (2010). 

The problem is doubly related to inverse problems: the conditional densities oi A + B and 
A given F as well as that of F are obtained solving inverse problems. We present two approaches 
when L > 3. The one presented in Section 3 involves the inversion of the Radon transform (see, e.g., 
Helgason (1999) and Korostelev and Tsybakov (1993), Cavalier (2000) and Hoderlein, Klemela and 
Mammen (2010) for statistical inverse problems involving the Radon transform on W^) applied to the 
derivative of a regression function. The one of Section 5.1 relies on the inversion of the Hemispherical 
transform (see, e.g., Funk (1916), Groemer (1996), Rubin (1999) and Gautier and Kitamura (2009) and 
Gautier and Le Pennec (2011) for a statistical inverse problem involving the Hemispherical transform). 
A second inverse problem appears when we obtain the conditional density of B given F assuming that 
A _L B\T. This corresponds to conditional deconvolution. Evdokimov (2010) considers conditional 



deconvolution in a different problem. In the classical deconvolution problem, the density of A is known 
and the characteristic function of ^4 + i? is estimated via the empirical characteristic function which 
estimates the true characteristic function at rate l/\/iV. An extension studied in Diggle and Hall 
(1993), Neumann (1997), Comte and Lacour (2009) and Johannes (2009) considers the case where 
the density of A is estimable at rate using a preliminary sample. In this paper A + B and 

B are unobserved and both conditional characteristic functions given T are estimated solving inverse 
problems using the same sample. 

Throughout this paper, we will analyze the distribution of random coefficients under the 
assumption that we observe an independent and identically distributed sample (yi, 2;^)j=i . jv 
where N is the sample size, and assume that the independent and identically distributed realiza- 
tions {tti, bi,j'[)i=i^,,,^N are unobserved. Because it allows to relate our results closely to the literature, 
in Section 2 we consider in detail the important case of a single continuous instrument. In section 3 
we present the more general case that there is more than one continuous instrument and we keep the 
index structure of model defined by equations (1.1)-(1.3). In the fourth section, we analyze the large 
sample behavior of our estimator. 

2. The Single Unobservable Case 

In this section, we start out with the standard specification for the selection equation in the 
treatment effects literature. We review the main result of Heckman and Vytlacil (2005, henceforth 
HV) in this setup, and show how to obtain the variance or the distribution of the treatment effect 
including a discussion of the respective assumptions. 

2.1. MTE and ATE with a Single Unobservable in the Selection Equation. 

2.1.1. Model and Assumptions. The setup we shall consider can be formalized as follows in a random 
coefficients setup: 



(2.1) 



Y = A + BX 

X = 1{P>V} V\Z^U{0,1) 



where X is a binary treatment indicator, P = it{Z) = Pr{X = 1 | Z) is the selection probability 
(sometimes also referred to as "propensity score"). This model could arise in several ways, and it is 
neither more or less general than the pure random coefficients model defined by equations (1.1)- (1.3). 
However, when we have one instrument the latter is nested in the former. To see this, recall that the 
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selection equation (1.2) is then defined as 



(2.2) 



X = l{Ti+T2Z2 > 0}. 



For identification of tlie density of r/||r||, we know from Gautier and Kitamura (2009) that it is 
sufficient that the support be included in some half sphere. This occurs if the coefficient of Z2 has a 
fixed sign. Changing Z2 in —Z2 we can assume that the coefficient is nonnegative. Assuming it is as 
well nonzero we can write the event inside the indicator of equation (2.2) as Z2 > — ri/r2. Moreover, 
this is observationally equivalent to 



where F_Y^jY2 is the cdf of —T1/T2 , with V ^ ZY(0, 1). This obviously leads to model (2.1) with 
7r(Z2) = F_ri/r2(-^2). However, in the case of d instruments, the two models are generally different, 
hence we treat the two approaches separately. Observe that at this stage, we impose additive separa- 
bility inside the indicator of the first stage (i.e., the selection) equation, which is obviously restrictive 
in terms of the unobservables that are allowed for, but at least allows for an unrestricted function vr 
of the instruments. We start out by treating this model for two reasons. First, we want to provide a 
useful addition to the standard HV type of framework. Secondly, we want to understand similarities 
and differences with the case when we have more than one unobservables'^ in the selection equation. 

The formulation is somewhat different from the standard treatment effects formulation. To 
see the parallels, A = Yq, and B = Yi - Yq. With this notation, Y = A + BX = Iq + {Yi - Yq)X 
which is more standard. It is also useful to think of Y as being generated by a nonseparable model; 
in this case Y = iIj{X, U), and Yq = iIj{0, U) = A, as well as Yi = -0(1, U) = A + B . As is obvious 
from these equalities, in the binary case the linearity in X in the outcome equation is without loss 
of generality. However, if we identify U with a high dimensional unobservable, e.g., preferences, it is 
interesting to note that we are thinking of our random coefficients A and B as two different functions 
of the unobservables, i.e., A = a{U) = ip{0,U) and B = b{U) = ^{l,U) - 0(O,C/). Without loss of 
generality, one could further partition the set of unobservable in vectors [/q, C/i, and U2- Here, U2 are 
unobservables that are common to both the treatment and the control group 

^ With the scahng of (2.3), there is only one unobservable which is V , though the size L parameter introduced later 

which corresponds to vectors in the coordinate system before rescaling is 2. 

We prefer the term subpopulation, but use this common name in the literature, treatment, Uq are specific to the 

control group but these unobservables do not affect the effect of treatment, and U\ impacts only the effect of treatment. 

This means that A = a{Uo, U2) and B = 6([/i, (72). 



(2.3) 
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We will analyze the model under materially the same assumptions as HV, but add one crucial 
condition to deal with distributions. The effect we analyze is the marginal treatment effect (MTE), 
which is, for p G (0, 1) : 

E[B I y = p] = E[yi -Yq\v = p\, 

which is the average effect of treatment for the subpopulation, for which V , the first stage preference 
parameter, takes on the value p. For p, we can think of this population as being indifferent be- 
tween participation and non-participation in the treatment, see HV (2005, 2007) for a more extensive 
discussion of this parameter. 

The assumptions we employ to first state the HV result, and then proceed to the distribution 
of effects, are as follows: 

Assumption 2.1. Let {il.,F,P) be a complete probability space on which are defined the random 
vectors {Y, X, Z, A, B,V) : Q ^ y x X x Z x A x B x V, y X = {0,1} , Z CR"^ , A ^R,B C R 
and VCR. The causal model is defined by equation (2.1) where tt : Z ^ [0,1] is Borel measurable 
function, and realizations of (Y,X,Z) observable whereas those of {A,B,V) are not. 

Assumption 2.2. All of the defined probability distributions (joint, marginal, and conditional) in- 
volving (Y,X,Z,A,B,V) , but X only appearing in the conditioning set, are absolutely continuous 
with respect to Lebesgue measure. 

Assumption 2.3. (i) {A,B,V) are independent of Z . (ii) The distribution of A = Yq and the 

distribution of B = Yi —Yq given V = p have moments of order one and E,[B\V = •] G L^(]R). (Hi) 
The conditional density of B = Yi — Yq given V = p is in L^(R) n L^(M) and 'E[e''^^\V = ■] and 
jgjg«i(A+B) |y _ .j^ where i = \/ — l are in L^{R). (iv) The conditional distributions of A = Yq and 
B = Yi — Yq given V = p have second moments and 'E,[A^\V = •], E[i?^|y = •] and 'K[AB\V = •] are 
in L^{R). 

As will be clear from the proofs, the model and the respective assumptions in Assumption 2.3 
imply that the conditional expectations E[y | 7r(Z) = p\, ¥.[Xe'*^ \ tt{Z) = p] and E[(l - X)e'*^ \ 
tt{Z) = p], exist and are differentiable. This does not need to be assumed. 

2.1.2. Main Result. These assumptions allow us to characterize the MTE 

Theorem 2.1. Suppose that Assumptions 2.1, 2.2 (i), and 2.3 (i) and (ii) hold in the model defined 
through equation (2.1). Then, 

E[B\V = p]= dpE[Y I tt{Z) = p] 
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holds. 



Proof. See Heckman and Vytlacil (2005). 

For an in depth discussion, see HV (2005, 2007). For our purpose, note that the MTE on the left-hand 
side is identified by the local instrumental variable (LIV) on the right-hand side. The only thing we 
want to point out is that, in order to obtain the average treatment effect (ATE), we would need to 
integrate over p from to 1, or. 



which requires "identification at infinity" in the sense that the instrument has to be informative 
enough to shift 7r(z) to both zero and unity. 

2.2. The Distribution of Treatment Effects with a Single Unobservable Case in the Se- 
lection Equation. 

2.2.1. Parameter of Interest. Having laid out the setup and related it to previous work, we introduce 
now the object of interest. In exactly the same setup as defined above in equation (2.1), we will be 
concerned with the distribution of treatment effects for the same subpopulation as the one considered 
by HV. More formally, we are interested in recovering 



for any h G suppiYi — Yq), where supp{Q) denotes the support of a random variable Q. In analogy to 
HV, we call this the "Distribution of Treatment Effects at the Margin", and abbreviate it DITEM. 
The interpretation is also quite similar to HV: For the subpopulation who is indifferent between 
participation and non participation, it provides us with a measure for the effect of treatment. However, 
this measure is now the distribution of effects. As mentioned in the introduction, the distribution of 
effects is different from the effect of treatment on the distribution. 

We will study this object under an additional identifying assumption 

Assumption 2.4. A L B\V 

Under our maintained set of assumptions, this assumption is sufficient for point identification 
of fB\v{b',p)- This assumption restricts heterogeneity appearing in this model. It is best understood 
in terms of the reformulation introduced above, namely A = a(Uo, U2) and B = b{Ui, U2)- A sufficient 
condition for assumption (2.4) is that V = U2, and Uq _L Ui\V (for the latter it would in turn be 




fB\v{b;p) = fY^-Yo\v{b;p) 



sufficient that Uq -L Ui -L V). In words, there is a common driving factor that causes endogeneity 
in the selection model, and it is given by V, which, even though it is not recovered, may serve as a 
control function. There is remaining randomness in A and B, however, once the driving factor for 
endogeneity in this system, i.e., V, is accounted for, there is no leftover endogeneity. 

Note that it does not mean that A J- B. In fact, unless there is no endogenous selection there 
will generally be dependence between Yq , and Yi — Yq. In other words, there is endogenous selection 
into treatment, but as far as it is endogenous, it can be summarized by V. 

This assumption will be weakened in the model with several unobservables in the selection in 
the sense that there is not just one factor that we can employ to control for endogeneity, but we have 
a full vector of such variables, meaning that if we have a richer set of unobservables Vi, . . . ,Vk, a 
condition like A J- B\Vi, . . . , Vk is more likely to hold. 

The proof of this results relies crucially on conditional characteristic functions (ccf's) of the 
scalar random unobservables H {A, A + B and B ), conditional onV = p, defined as 

E [e'^^\V = p] . 

To recover the ccf oi B aX V = p, we require the following condition: 
Assumption 2.5. 

Vt G M, Vp G (0, 1) : 9pE [(1 - X)e'^^\P = p] / 0. 

As we shall see in the proof of Theorem 2.2, the quantity in Assumption 2.5 is indeed E [e*"^|F = p\ 
the ccf of the distribution of the untreated subpopulation conditional on the unobservables in the 
selection equation (the source of endogeneity). This assumption is technical and it is classical in de- 
convolution problems where A is the noise corrupting the signal B. Characteristic functions of most 
standard distributions (normal, log-normal, Cauchy, Laplace, x^i Student-t, etc.) do not vanish. 

2.2.2. Distribution of Treatment Effects. These assumptions allow us to characterize the DITEM: 

Theorem 2.2. Suppose that Assumptions 2.1, 2.2, 2.3 (i) and (iii)^, 2.4 and 2.5 hold in the model 
defined through equation (2.1). Then, 



E e**^ y = v\ = — — — 



^ The square integrability is indeed only useful for the analysis of the rates of convergence of the estimator. 
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and 
holds. 

Remarks: Since B = Yi — Yq, this gives the distribution of treatment effects. From there on, 
we can get many quantities of interest. If the score varies in the whole range (0, 1), which amounts to 
identification at infinity, we can obtain the unconditional characteristic function 

or the unconditional density of the treatment effect 

Alternatively, akin to Carneiro, Heckman and Vytlacil (2010) we can look at how weighted averages 
change with 9, i.e., dgxiii^) or dgC{b,9) (Marginal policy relevant treatment effect) where 



, , , dpElXe'^^lP = p] , , 







or 



For such quantities it is not necessary that the score varies in the whole range (0, 1). 

2.3. Variance of Treatment Effects. Similar computations as those used in the proof of Theorem 
2.2 allow to obtain various moments of the distribution of treatment effects under milder assumptions 
than in Section 2.2. For example, we state Theorem 2.3 which allows to get the variance. It is obtained 
by simple algebra. 

Theorem 2.3. If Assumptions 2.1, 2.2, 2.3 (i) and (iv) andW.[AB\V\ =W.[A\V\W.[B\V] hold, then 

Var{Yi - Yq\V = p) = Var{B\V = p) = 7(p) 

where 

7(p) =9pE [XY'^\P = p\- {2dpE [(1 - X)Y\P = p] + dpE [Y\P = p\) 6>pE \Y\P = p] 
-fitpE [(1 - X)Y^\P = p\ . 

Strikingly, even if we do not assume E = E E we can get the following bound 

on this quantity. 
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Theorem 2.4. If Assumptions 2.1, 2.2, 2.3 (i) and (iv) hold, then the fohowing sharp bounds hold 
{Var{B\V = p) - j{p)f < 45pE [(1 - X)Y'^\P = p] dpE [XY'^\P = p] 

where 

7(p) = dpE [XY^\P = p]- {dpE [Y\P = p]f + dpE [(1 - X)Y^\P = p] . 

2.4. Instrument with Limited Support. In the scalar instrument case and one unobservable case, 
if we consider the initial model (1.1)-(1.3) and assume that the coefficient on Z2 is positive (recall 
that Zi = 1), we obtain a selection equation of the form: 

X = l|Z2>f} 

for some scalar unobservable F that is independent of Z2 but not necessarily uniformly distributed. 
Starting from here we use the notation T to denote a rescaled version of F, we denote by 7 without the 
tildes the arguments of functions when it is not subject to confusion. Assume that supp{Z2) = Iz is an 
interval and denote by int(/^) its interior. Consider for example the MTE, because from Assumption 
2.3 Z2 and F are independent, setting 

V7 G Iz, ry(7) = nY\Z2 = 7] = E[^] + / IE[S|r = j]ff{^)l {7 > 7} ^7- 

J sMpp(r) 

This yields 

(2.4) V7 G int(/^), 4(7) = K[B\f = 7]/p(7)l {7 e suPP (?) } • 
Similarly, setting 

V7 G Iz, rxil) = nX\Z2 = 7] = / . /r(7)l {7 > 7} dj, 

we get 

(2.5) V7 e int(/z), r'^il) = ffil)! {l G supp (f) } . 

We extend both derivatives r'y and r'^ to be on the complement of mt{Iz)- The extension of the 
derivatives are the natural extensions when supp (f) C Iz since, for example, for any Z2 in Iz such 
that Z2 > 7 for every 7 in supp , K[Y\Z2 = Z2] = E[A] + K[B] and when ^2 ^ 7 for every 7 in 
supp (f) , E[y|Z2 = Z2] = K[A] and the derivatives are 0. Note that 

/oo ^ 
r'x{j)dj = F{T€lz). 
-00 
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This yields 



nB\T G Iz] 



It corresponds to the average treatment effect for the sub-population whose unobservables vary in 
a range that can be apprehended by the variation of the instrument. Under the supplementary 
assumption that supp (v^ C Iz this is the MTE for the whole population. Similarly, it is easy to 
revisit the derivations of Section 2.2 and get for example that 



(2.6) 



1 

'2^ 



oo roo 



-itb 



oo J — oo 



^1 


f A+B,r 


(i,7) 






(i,7) 



-^*/r|re/z(7)d7 



/r|re/;,(7) 
where V7 G mt{Iz), 



r'xil) 



/_oo''x(7)d7 



(2.7) 
(2.8) 



^1 



/ 



A+B,r 



(t,7) = d^E[Xe'^^\Z2 = 7] 



/aF (t,7) = -5^1E[(l-X)e**nZ2 = 7], 



stands for the partial Fourier transform with respect to the first argument and we use the notations 
•^A+sf' -^AT' ^^"^ ^BT ^"-"^ joint densities of respectively {A + -B,r), (^,r) and (-B,r). We will 
also use the notation J^2 to denote the partial Fourier transform with respect to the second argument 
which will be 7 in the sequel. Recall the definition of the partial Fourier transform: 



J'2 



/at (*'7) 



■^A,r 



(x,w) 



We denote by J- the usual Fourier transform, for example. 



(t,w) 



The conditional density of the treatment effect given by (2.6) coincides with the density of the treat- 
ment effect for the whole population if and only if supp (t^ C Iz, i-S- the instrument has enough 
variation to capture all the heterogeneity in the population. The assumption supp (t^ C Iz could be 
easily tested as it is enough to check that jj r'^{^)d'y = 1. 
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2.5. Principles of Estimation. To focus on the central innovation in this paper, we consider now 
the case of one instrument and limited support, as in the previous subsection (in the case of many 
instruments, we advocate the use of several unobservables, see below). 

The integral over t in (2.6) is the Fourier transform inversion formula. The inverse Fourier trans- 
form of a ratio of characteristic functions is a classical structure for an estimator in deconvolution 
problems where one observes A + B where A and B are independent and one knows the characteristic 
function of the error A. Optimal rates for deconvolution problems under various smoothness assump- 
tions on the densities of A and B are given in Fan (1991), Butucea (2004) and Butucea and Tsybakov 
(2007). 

Here we are reasoning conditional on the unobservable P, because A and B are independent 
given r only. Conditional deconvolution also appears in Evdokimov (2010). In this article the density 
of the second equation unobserved heterogeneity parameter F acts as a mixing distribution. So we 
are dealing with conditional deconvolution in a nonparametric mixture context. The degree of ill- 



fAf (*'7) 



posedness should hence related to the decay to zero of the partial Fourier transform 
in t in a certain sense. 

A second difference to standard deconvolution is that the distribution of A is unknown but 
estimable. This is a particular case of an inverse problem with an unknown but estimable operator, also 
encountered in econometrics in the case of nonparametric instrumental variables (see, e.g. Carrasco, 
Renault and Florens (2007)). This situation in a deconvolution problem has been studied in Diggle 
and Hall (1993), Neumann (1997) and later by Johannes (2009) and Comte and Lacour (2009). They 
obtain mathematical results when a preliminary sample allows to estimate the characteristic function 
of A via the empirical characteristic function^ at rate 1 / ^/N. In there setting A + B is observed 
on a second sample and the characteristic function A + B is again estimated via the empirical 
characteristic function. Neumann (1997) and Comte and Lacour (2009) obtain lower bounds, under 
various smoothness assumptions on the densities of A and B, that account for the extra difficulty of 
estimating the characteristic function of A. The estimator of Comte and Lacour (2009), which is built 
on that of Neumann (1997), is adaptive. 



5 

_ 1 

4>A{t) = — ^exp(itai) . 
1=1 
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A third difference is that we neither observe A+B but can estimate the partial Fourier transform 
^1 Ia+b f ^) solving an inverse problem (see next paragraph). Up to our knowledge, this case has 
never been studied before. In Evdokimov (2010) A + B is observed and the empirical characteristic 
function oi A + B in classical deconvolution problems is replaced by the Nadaraya- Watson type 
estimator of the characteristic function of ^ + i? given T. This is not possible here. 

In this article we use (2.7) and (2.8) to estimate the partial Fourier transform J^i f^_^^ f {t, 7) 



and 



that replace the characteristic functions in the classical deconvolution problem. The 
mixing distribution /p is estimated estimating the two quantities in the ratio in (2.6). Estimation 
of the numerator is also the estimation of the derivative of a regression function. Estimation of 
derivatives of regression functions is typically an ill-posed inverse problem. Note also that in the 
case of degenerate design (points where the density is zero) the rates of estimation of regression 
functions can be degraded as for inverse problems, see for example Gaiffas (2009) and the references 
therein. Gaiffas (2009) proposes an adaptive procedure for the sup-norm using local polynomials and 
a selection rule similar to Lepski's method. However, up to our knowledge, there are no results for 
the estimation of derivatives in the presence of random and possibly degenerate design or extensions 
to an inverse problem setting (the case of Section 3). The case of degenerate random design is a 
situation encountered in an inverse problem setting in both Hoderlein, Klemela and Mammen (2010) 
and Gautier and Kitamura (2009) where regressors are unbounded and degeneracy occurs at "infinity" 
for many designs of interest. 

To summarize, our proposed estimator is computed in the following way: 

(1) Step 1: Ti fj^j^Q Y ^^"^ -^1 f ^) ^^'^ obtained by estimating the derivatives of the two 
regression functions. There are many estimators of derivatives of regression functions.*^ 



^ For example, local polynomials estimators (see, e.g., Fan and Gijbels (1996) and Tsybakov (2009)) can be used. In 
the case of (2.7) the estimator is obtained as follows 



-^1 [/a+s.f] (*.7) = JJj^^^^^'^^^'^i^ii'y) where Wjv,(7) = eiB^l^U 



N / V 

U K 



U{u) ^ (^l,u,u'/2\,...,u'/V^ and Bn-, = j^Y.^ 



N \ n.M J \ riN 

4—1 



K is a. kernel, /ijv a bandwidth, ei a row of size / + 1 with all coordinates being but the second one which is equal to 
1 and / > 1, in practice Z = 2 is used for the estimation of one derivative. 
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(2) Step 2: Compute the following integral (in practice this is carried out numerically)^ 



(2.11) /^|p(6;7)^- 



^ AT 



■(t,7)l 



dt 



where N is the sample size, /iat^^ is a bandwidth going to zero with N , K a kernel (a typical 
kernel that we will use in Section 4 is K{t) = l{|t| < 1} with h^^-^ = it amounts to 

truncation of high frequencies), iAr,t,7 a proper trimming factor. It has the same structure as 
that of Neumann (1997) ^. h^^^ depends on 7 because of the possible different decay rates of 
f different values of 7. When p is known, K avoids dividing by small 



values of J^i 



^AT 



and J^i 



/ . p is small for large values of the frequency t. 



(3) Step 3: Rely on plug-in estimators of the numerators and denominators of the estimable mixing 
density ^^(7)/ "^^^ numerator could be obtained by estimating the derivative 
of the choice probability and the denominator by integrating it numerically.^ If we denote the 



corresponding estimator /p|p^^^, the global estimator is given by 



(2.12) 



Because the last integral with respect to 7 in (2.12) should in practice be carried out numerically 
using a quadrature method, (1) and (2) need only to be carried a finite number of times. Section 4 
considers the asymptotic behavior of this estimator. 



Alternatively the two following more simple estimators are possible 



(2.9) 



(2.10) 



/B|r(^;7) 



1 
2^ 



/Bif(»;i) = 2; 



e (t hpf.-/] 



■^1 [fA,r\ 



[/A.r](i.7) 



> tM.t 



dt 



{t,j)dt. 



(2.10) is in the spirit of Diggle and Hall (1993) and the asymptotic analysis and tuning of the bandwidth is known to 

be tricky, see, e.g. Neumann (1997). 

® In Neumann (1997) and Comte and Lacour (2009) because the characteristic function of A is estimated at rate 

1/v'iV, tN,t,-i could be taken equal to N^^^'^, independent of t and 7. 

^ Under the assumption that supp C Iz the denominator is equal to 1 and we get the distribution of the treatment 
effect for the whole population. In that case the denominator does not need to be estimated. Recall that this could be 
tested. 
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To obtain marginal relevant treatment effect simply replace d^y by uj{^,6)d'y above. The as- 
ymptotic analysis in that case is the same as without the weight ui{'y, 9) when the weight is bounded. 



3. The Multivariate Unobservables Case 

In this section, we extend the previous framework to cover models with a higher dimension of 
unobservables at the expense of being linear in the instruments. We show how to obtain the MTE, 
which is now a function of several unobservables, and extend the approach to obtain the distribution 
of treatment effects, including a discussion of the respective assumptions. 

3.1. MTE and ATE in the Case of Multivariate Unobservables and Instruments in the 
Selection Equation. 

3.1.1. Model and Assumptions. Compared to the univariate unobservable in the selection equation, 
which results in a model that has more of the flavor of a reduced form model, we formalize our (more 
structural) model with several unobservables and as many instruments as follows: 

Y = A + BX 

(3.1) \ 

X = l{D^Z2 + VZi < Zs} 

where Zi = 1, and D is now a vector of unobservables of dimension L — 2 . As discussed previously 
and in Gautier and Kitamura (2009), imposing that one coefficient in the selection equation has a 
sign is sufficient for identification of the distribution of the (scaled) random coefficients vector in this 
equation. To account for scale invariance, we divide the latent equation in (1.2) by this coefficient. We 
get the second equation of (3.1) when the coefficient of Z3 is negative, otherwise change Z3 in — Z3. 
We use the notation T = (-D^, V)"^ for our random vector of scaled first stage (selection equation, 
abbreviated FS) unobservables. Observe that this only makes a difference if L = dim(Z) > 2, or 
else we are generally back in the previous model. We introduce the notation S = (Z2, 1)/||(Z2, 1)|| 
and U = ^3/11(^2, 1)11, so that the FS becomes X = 1{T^ S < U}. Note that the support of S is 
necessarily included in a hemisphere H, while U is a scalar which can be positive and negative. 

It is possible to do everything below assuming working with a vector F of norm 1 in (1.2) as 
in Gautier and Kitamura (2009). We decide to present this different scaling in order to present a 
new approach to deal with problems involving a random coefficients binary choice. This alternative 
approach shares many similarities with HV. With this normalization the natural operator is no longer 
the Hemispherical transform but the Radon transform (see, e.g., Helgason (1999)). 



17 

Theorem 3.1 below involves the Radon transform which is defined for / G s in a 

hemisphere of the Euclidian space M^"^ and u in M through 

Rf{s,u)= I fh)dp^J^) 

^ Ps,u 

where Pg^u = {7 : 1^ s = u] is an affine hyperplane of dimension L — 2 in and dp^^ is the 

LebesK ue measure on Pg^u- Mathematical results regarding this integral transformation (among which 
its injectivity, an inversion formula involving the adjoint of the Radon transform and the projection 
theorem) can be found in Natterer (1996) and Helgason (1999). Statistical inverse problems involv- 
ing this operator on the whole space appear in several problems from tomography (see for example 
Korostelev and Tsybakov (1993) and Cavalier (2000)) but also when one wishes to estimate distribu- 
tion of random coefficients in the linear model with random coefficients (see Hoderlein, Klemela and 
Mammen (2010)). 

Assumption 3.1. Let {il.,F,P) be a complete probability space on which are defined the random 

vectors {Y, X, s,u, A, B,r) : n ^ y x X x s x u x A x B x g,y c R,x = {0,1}, 5 c s^-^M C 

M,^ C C C M^-i where L is an integer. The causal model is defined by equations (3.1) 
where the realizations of {Y, X, S, U) are observable whereas those of {A, B, T) are not. 

Assumption 3.2. (i) {A,B,T) are independent of Z. (ii) The distribution of A = Yq and the 

distribution of B = Yi — Yq given F = 7 have moments of order one and E[i?|r = •]/p(-) G L^{M.^~^). 
(Hi) The conditional density of B given T = j is in L^{R) n L^{R) for almost every 7 and 
E[e**^|r = •]/p(-) and E[e^*(^+^)|r = •]/p(-) are in L\R^^^). 

Assumption 3.3. There exists a ball B centered at such that supp (t^ C B C supp (SU). 

This assumption allows to deal with instruments with limited support as long as the unobserved 
heterogeneity varies within a limited range. 

Theorem 3.1. Make assumptions 3.1, 3.2 (i) and (ii) and 3.3. Then, defining the arguments of 
below as outside the support of {S, U), the following formula holds 

fl-l[|;E[X|(S,t/) = (.)]] (7) 

where is the inverse of the Radon transform (see, e.g. page 15 of Helgason (1999)). 
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Remark 3.1. Suppose that in the case of Section 2, instead of regressing on P, we regress on Z {Z2 
since Zi = 1), then we would obtain that 

3.2. The Distribution of Treatment Effects in the Multivariate Unobservables and In- 
struments Case. 

3.2.1. Parameter of Interest. In exactly the same setup as defined above in equation (3.1), we will be 
concerned with the distribution of treatment effect for the same subpopulation as the one considered 
by HV. More formally, we are interested in recovering 

for any h G suppiYi — Yq), and any 7 G t/. To emphasize the parallels to above, we call this the 
Distribution of Treatment Effects at the Margin", and abbreviate it DITEM. The interpretation is 
also quite similar to HV: For the subpopulation defined by T = 7, it provides us with a measure for 
the effect of treatment. However, this measure is now the distribution of effects. As mentioned in the 
introduction, the distribution of effects is different from the effect of treatment on the distribution. 
We will study this object under an additional identifying assumption 

Assumption 3.4. A _L B\T 

Recall that if yl = a(Uo,U2) and B = b{Ui,U2), a sufficient condition for assumption (3.4) is 
that r = 5 and Uq _L f7i|r. In words, there is a common driving factor that causes endogeneity 
in the selection models, and it is given by T. In contrast to before, this is now an entire vector of 
unobservables and it is more realistic that this vector accounts for a potentially complicated structure 
of heterogeneity and correlation. Recall again that it does not mean that A J- B. In fact, unless there 
is no endogenous selection, in general there will be dependence between Yq, and Yi — Yq. In other 
words, there is endogenous selection into treatment, but as far as it is endogenous, it is captured by 
the vector T. 

To recover the ccf of -B at F = 7, we require the following condition: 

Assumption 3.5. Extending as zero outside the support of {S, U) the conditional expectation in the 
argument of R^^ below, 

d 



V(t, 7) G M X supp (T) , R 



^^E[e'^y{l-X)\{S,U) = {s,u)] 



(7) / 0. 
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This assumption is the analogue of the technical assumption 2.5 and is the classical assumption 
in deconvolution problems. Indeed we will see in the proofs that this corresponds to £[6**"^!^ = jjff^ 



'r=7- 



3.2.2. Main Result. These assumptions allow us to characterize the DITEM. 

Theorem 3.2. Let assumptions 3.1, 3.2 (i) and (iii), 3.3-3.5 be true. Defining the arguments of 
below to be zero outside the support of (S, U), the following formulas hold 

R-i [d_E [e^'yX\{S,U) = {■)]]{!) 



(3.2) 


/B|r(^;7) = 


2ttJ 


poo 

e 

— oo 


(3.3) 


/r(7) = 




du 



-itb 



dt 



[lne^^y{l-X)\{S,U) = {■)]] (7) 
(7). 

Note the parallel to above: if we do not take the conditioning and derivative wrt p, but wrt z, 
we obtain: 

Mv(''M^)) = 2ilj -8..^.ftE[(l-Xy"-|zL,| ''«- 
3.3. Estimation. The overall estimator is the same as in Section 2.5 and just requires an adaptation 
of Step 1. We thus only present two possible ways to estimate the inverse of the derivatives of regression 
functions which appears in the numerator and denominator of (3.2) and in (3.3). We simply present 
the estimation of /p , the estimation of the numerator and denominator of (3.2) are the same. 
A classical regularized inverse of the Radon transform is given by: 



(3.4) R-R'[f]{l) 
where i? is a smoothing parameter and 

(3.5) 



H J-oo 



Kr{s ^ — u)f{s,u)duda{s) 



Mu G M, Kr{u) = 2(27r)-(^-^) / cos{tu)t^-'^dt. 

Jo 



This suggests using as an estimator of the inverse Radon transform of a derivative of a regression 
function 



S(7) = R~R 



d_ 

du 



E[X\iS,U) = {■)] 



(7) 



in the case of the estimation of /p (for example), where -^E,[X\{S,U) = (•)] is an estimator of the 
derivative of the regression function and R is chosen adequately. 

The following second estimator is easy to compute but only works in specific situations. We 
introduce 

Vu G M, Kr{u) = 2(27r)"(^^i) / sm{tu)t^~'^dt. 

Jo 
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it is such that K'j^{u) = Kji{u) where Kji is defined by (3.5). Note that hm|„|^j^ K'j^{u) = hm|„j^oo 
by the Riemann-Lebesgue Theorem. We make the foUowing assumption 

Assumption 3.6. (i) Vs € H, supp (^fjj^g{-; s)^ = M. (ii) For almost every s in H, u ^ 
E[X|(5,C/) = {s,u)]kR{u) andu^ du^[X\{S,U) = {s,u)]Kr{u) are in L1(M) and u ^ ¥.[X\{S,U) = 
(s,ii)] is continuous and \\m.\^\^^^'K[X\{S ,U) = {s,u)]Kr{u) = 0. 

This Assumption ahows to justify an integration by parts argument for the regularized inverse 
counterparts of (3.3). 



Proposition 3.1. Under Assumption 3.6, 

The trimmed sample counterpart estimator is given by 

(3.6) /f(7) = T7 E ^^=-^^ —l{ f^s,u)isi,ui) >tU 

i=i f(s,u){si,Ui) 

where f(s,U) is a plug- in estimator of f(s,U)- We classically introduce trimming to avoid dividing 
by denominators too close to 0^*^. t^, are trimming and smoothing parameters that should be 
adequately chosen. 

For the estimation of the partial Fourier transforms in the numerator and denominator of (3.2) 
simply replace above Xi by respectively e**^'Xj and — e**^'(l — Xj). Because of possible different smooth- 
ness, the trimming and smoothing parameters should be adjusted as well for both the numerator and 
denominator. 

Note that estimators already^^ exist if we normalize P to be of norm 1 using Gautier and 
Kitamura (2009) and Gautier and Le Pennec (2011). 



du 



E[X\iS,U) = i-)] 



(7) = E 



KnjS^j - U)X 
fis,u){S,U) 



The trimming could be suppressed to obtain convergence in probability results when f(s,u) is bounded from below. 
This is not compatible with Assumption 3.6 (i). Note the similarity between the introduction of trimming here and in 
the deconvolution problem (via the kernel and trimming factor that accounts for estimation error on the denominator). 
When ,f(s,u) is unbounded from below the rates are degraded. 

There is an easy analogue for the estimation of the partial Fourier transforms. 
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4. Asymptotic Analysis 

We denote by ||/||p = ^ f{hYdh^ for p G [1, oo) the classical norms and by ||/||oo the 
essential supremum norm, also called sup-norm for simplicity. We consider in this paper an upper 
bound on the squared risk for simplicity. C is a constant whose value can change from line to line. 

4.1. Estimation of fs- We will start with a proposition that relates the estimation of fs with the 
estimation of /p and of /^|p- Take w : supp (t^ — >• M a weight function. We will make the following 
assumptions. 

Assumption 4.1. /p € L°° ^supp (t] 
Assumption 4.2. w /^|p G x supp ^Fj 

Proposition 4.1. Let assumptions 3.1, 3.2 (i) and (iii), 3.3-3.5, 4.1 and 4.2 be true, then 



(4.1) 



< 3 



r -Ty 

1||2 



/i?ir(-;*) - /B|f(-;*)) 



ff-ff)w ^ /bip(-;*)w^W 



We now consider convergence in probability in order to easily handle the various plug-in terms, 
especially of f(s,U) the second estimator in Section 3.3, and multiplications. 

4.2. Estimation of /^|p. In order to work with smoothing and trimming factors in (2.11) that are 
independent of t and 7, we work with sup-norm consistency of the estimators of the partial Fourier 
transforms. 



Assumption 4.3. 



sup 

teR, -yesuppir) 



^1 



' A+B,T 



f 



A+B,r 



sup 

tes, ■yesupp(f) 



^AT 



Op {rA+B,N) 

Op {rA,N) 



Unlike deconvolution situations with noise observed on a preliminary sample, each rate is 
nonparametric and it is the rate of estimation in an inverse problem. As mentioned in Section 3.1.1, 
we have used a scaling giving rise to the Radon transform to present an approach to the estimation 
of /p different from that of Gautier and Kitamura (2009). However, we could rescale F to be on the 
sphere and inverse the hemispherical transform to obtain /p and the partial Fourier transforms in 
(2.6), this is done in Section 5.1 . The minimax rates of estimation as well as an adaptive estimator 
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are given in given in Gautier and Le Pennec (2011) in the case of an estimator of /r- In the proposition 
below we use the notation i^^jp(t; 7) for the ccf of B given F for F = 7 evaluated at t. 

Proposition 4.2. Let assumptions 3.1, 3.2 (i) and (iii), 3.3, 3.5 and 4.3 hold. Take t]y^t,-y = '''a,n- 
The following upper bound holds 

(4.2) 



Or. 



7 - r 

, J suppiT) J —I. 



sMpp(r) 

K{t hN,-yf 



/a? (*'7) 



il-K{t hN,^)f </'s|f(i;7) 
2 [^A+B,N + '/'B|r(*;7) 



\ 



w'^{'j)dtdj 



J 



Besides the integral in 7, the upper bound is the same as in Comte and Lacour (2009) where 



M+_B Af ^^'^ '''an respectively l/yN and 1/vM, with M is the sample size of the preliminary 
sample used to estimate the characteristic function of A. Here, these parametric rates are replaced 
by ill-posed inverse problems nonparametric rates. The first term in the upper bound is the square of 
the approximation bias. 



Because 



(/)^|p(t;7) < 1, we obtain as a corollary of (4.2) 



(4.3) lf^^~(.;^)-f^^~(.-^) w{*) 



Or. 



suppir) J —00 



{l-K{t hN,y)y 0Bir(t;7) 



+- 



K{t hN,jf 



fA,f (*'7) 



{'''a+b,n + '''Xn) 



When supp \ Tj is bounded, we can take w = 1 above. Another sensible choice is to take w = (/p)" for 
some a € (0, 1] ensuring integrability. Note that when a = 1, the following term (j)^^^(t;'y)f^{'y) 

r 1 ^ 

-^1 /b.F (*'7) appears. 

More precise rates could be obtained making smoothness assumptions implying specific rates 
f\+B N and r\ as well as a smoothness assumption on and an assumption on the decay rate 
to zero of J^i p (t; 7) . An adaptation of classical ellipsoids for is 

•A.i^r,a,w{L) = |f conditional density on M given 7 G M^"^ : 

\F[m;^)\\l + t'fe^^{2a\tY)dtw\^)d^<LA 



I -/ 

J supp{r) J —00 



23 



where r>0, a>0, 5£M and (5 > 1/2 if r = 0, / > 0. The case r > corresponds to an extension 
of the case of super smooth functions, otherwise the functions are extensions of ordinary smooth 
functions (in the Sobolev class). The case where w = 1 and w = f^; are the more natural ones. 

When K{t) = l{\t\ < 1} and we take hj\i^j of the form 1/ R^, the square of the approximation 
bias can be bounded in the following way 

/ / 0B|r(*; 7)^^(7) dtd^ < ({Rfjf + iy exp (-2a {R§Y) . 

Jsupp{v) J-oo ' \ y \ y 

The assumption on the decay rate of J-i p (t;7) strengthens the assumption 3.5. 

Assumption 4.4. There exits 5(7) > 0, 6(7) > 0, ??(7) eR ( 7?(7) > if 5(7) = 0) and ko{'y), ^1(7) > 
such that 

(1) [(i)] 
(2) 



A:o(7)(l + t2)-^(7)/2 (■_5(^)|i|«(7) j < 

(3) or 



{t, 7)| < A:i(7)(l + exp 



A:o(7)(l + i')-''^^)/'exp(-6(7)|trW) < [</.^|f] (t; 7)] < A:i(7)(l + i')-''^^)/' exp (-6(7)!* 



5(7) 



Proposition 4.3. Let assumptions 3.1, 3.2 (i) and (iii), 3.3, 3.5 and 4.3 hold. Assume either (1): 
supp (t^ is bounded, w = 1, and /^|p belongs to As,r,a,i{L) and Assumption 4.4 (2) with constants 
independent of 7, or (2): w = /p, /^|p belongs to As^r,ajf{L) and Assumption 4.4 (3) with constants 
independent of 7. Take tj\[,t,"/ = '''A,N- The following upper bounds hold: 

(1) [(i)] 

(2) if s = r = 0, then 
(4.4) 



2max{r;-5,0)+l 



(3) if s > and r = 0, 



(4.5) 



(4) if s = and r > 0, then 



^xmin(l+2r)-s,2(»7-5)) 



(4.6) 
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(5) if s > and r > 0, then 
(4.7) 

where 

A «) = (^B)min(l+2,-.,2(,-5)) g2fe(R^)«^^ > ^| + ) max(2(,-5),0) ^2(fe-a)(il^)« = s, 6 > a} 

+ l{{r > s} U {r = s, 6 < a}}. 

5. Appendix 

5.1. An alternative approach to the identification and estimation of the partial Fourier 
transforms. In this section we assume that P(r = 0) = 0, we rescale T in (1.2) to be of norm 1 and 
denote by 5 = Consider for example the case of [fA,T\- 

E[(l - X)e^^^\S = s\= E[(l - X)e^^^\S = s] 

= E[e^*^] - ^[l{s^7 > 0}e^*^] (using (1.3)) 

= E[e'*^] - / 1 {.^7 > 0} (-Fi [/^,r] (t,7)) da{^) 
= E[e''*^]-^(^i[A,r] {tr)){s) 
(5.1) =\w''']-n{{Tr[fA,r]{t.-))-){s) 

where a is the spherical measure on the sphere of the Euclidian space M^, T-L is the hemispherical 
transform (see, e.g., Gautier and Kitamura (2009)) and /" is the odd part^^ of a function /. 
If we assume full support of the regressors then the (i) of the following assumption holds. 

Assumption 5.1. (1) [(i)] 

(2) The resettled vector of instruments S has a density with respect to a and its support is the 
whole hemisphere iJ+ = {s G S^'^ : s^(l, 0, . . . , 0) > 0}. 

(3) r has a density f-p with respect to a which is defined point-wise and has support included in 
some hemisphere H = {s £ S^~^ : s'^n > 0}, where n is a vector of norm 1 that does not 
need to be known. 



^"^ Odd, respectively even, functions are the closure in L^(§^ ^) of continuous functions such that Vs G §^ ^, /(— s) = 
-/(s), respectively Vs G §^"\ /(-s) = /(s). Each function in L^(S^-^) is the sum of its odd and even part. We denote 
by LLd(S''"^) the subspace of L^(§''"^) of odd functions. 
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Assumption 5.1 (3) is slightly weaker than the assumption we made in Section 3.1.1. Equation 

(5.1) yield that, under Assumption 5.1 (3), E[(l - X)e**^|5 = s] - ^Efe**^] can be extended in a 
unique way as an odd function defined on the whole S^~^ (it is initially only defined on according 
to Assumption 5.1 (2)) through 

Vs G H+, RA{t, s) = E[(l - X)e''^\S = s] - ^E[e'*^] 
Vs E -H+, RA{t, s) = -RAit, -s). 

It is remarkable that E[e**'^] is also identified in this model. This is due to the smoothing properties 
of ^. Indeed (see, e.g., Gautier and Kitamura (2009)), because i?^(t, •) belong to 'H(L^^^(S'^~^)), it 
is continuous and odd. Thus for any point s on the boundary of H~^, i?^(t, s) = —RAit,—s). This 
yields 

(5.2) lim E[{l-X)e''^\S = s]+ lim E[{1 - X)e''^\S = s] =E[e''^]. 

Because the right hand side does not depend on s, a more efficient estimation takes into account all 
these relations for all s on the boundary of . Given an estimator (pAii) of £[6**"^], we can get an 
estimator of J-i [fA,r] with the same formulas as in Gautier and Kitamura (2009) or Gautier and Le 
Pennec (2011) replacing 2yi - 1 by 2{xi - l)e**^» + (pAit). In the case of the estimator of Gautier and 
Kitamura (2009) (see the reference for more details) and delayed means smoothing kernels we get 
(5.3) 

^-7^.. , ( 2 y x(2l.+ l,2T.)M2p+l,L) / 1 f (2(x.-l)e-'^-+0l(t))c.-iy,(.f7) \ 

I p=o M2p+l,L)C2p+i[l) yi^i ma.x Us[Si),mN] J 

where |S^~^| = y\lI2) surface measure of S-^~^, h{n,L) = ^'^J^^^j^_2y,(^n+L-2)' ' — ~ 2)/2, 

A(2p + 1,L) = (7-i)(L+i|^'^L+2p-i) ' x{n,T) = ipin/T) where V : [0,oo) [0,oo) is infinitely 
differentiable, nonincreasing, such that 'ip{x) = 1 if x € [0, 1], < ^{x) < 1 if x G [1, 2], ip{x) = if 
X >2, and C^(-) are the Gegenbauer polynomials^'^ T^r is the smoothing parameter, uin a trimming 
factor and fs an estimator of the density of S. 

5.2. Proofs. 



The Gegenbauer polynomials are given by 
where (a)o = 1 and for n in N \ {0}, (a)„ — a{a + 1) • ■ • (a + ?i — 1) = T{a + n) /T{a). 
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5.2.1. Proof of Theorem 2.2. Consider the conditional expectations: 

E [(1 - X)e'^^\P = p]=E[{l- X)e^*^e'*^^|P = p] 
Note that (1 - x)e^*^e^*-^^ = (1 - X)e^^^. Hence 

-E[{1- X)e'^^\P = p] = -E[{l-X)e'^^\P = p] 

= -E[{l-X)E[e'^^\V,P]\P = p] 

= -E [(1 - X)E [e'^^\V] \P = p] (from Assumption (2.3) (i)) 

= - I E[e'^^\V = v\dv (because T/|Z~ZY(0,1)) 
Jp 

= - [ E [e'^^\V = v]dv+ Te [e'^^\V = v] dv 
Jo Jo 

Differentiating with respect to p produces 

~dpE [(1 - X)e**^|P = p]=E [e^*^|T/ = p] 

Similarly, 

E [Xe'^^ \P = p] = [e^*(^+^) \V = v]dv 

Jo L -I 

= rE[e'*^e'*^\V = v]dv 
Jo 
[■p 

= / E[e'^^\V = v]E[e'^^\V = v]dv (from Assumption (2.4)) 
Jo 

Differentiating wrt p produces 

dpE [Xe'*^\P = p]=E [e'^^\V = p]E [e'*^\V = p] . 

As a consequence, 

fi F r yJ^^ I p — Til 

E[e'^^\V = p] 



dpE[{l- X)e'^Y\P = p] 
Q.E.D. 

5.2.2. Proof of Theorem 2.3 and Theorem 2.4. Like in the proof of Theorem 2.2 we can check that 
dpE [XY^\P = p]=E [B^\V = p]+2E [AB\V = p]+E [A^\V = p] 

and 

dpE [XY^\P = p]- dpE [(1 - X)Y^\P = p] 
= E [B^\V = p] + 2E [AB\V = p]+E [A^\V = p]-E [A^\V = p] 
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= E [B^\V =p\ +2E [AB\V = p]. 

Since 



(5.4) |E [AB\V = p]\<E [\AB\ \V = p] < [B'^\V = p]E [A^\V = p], 

we get 



{dpE [xy2|p = p]- dpE [(1 - = p]-E [b^\v = p] } 



2 



< E[A^\V = p 



4E [B'^\V = p] 

= dpE[{l- X)Y^\P = p] 

Theorem 2.1 yields that 

'Var{B\V = p) - dpE [XY^\P = p] + {dpE [Y\P = p]f + dpE [(1 - X)Y^\P = p] 

< AdpE [(1 - X)Y^\P = p] (Var{B\V = p) + {dpE [Y\P = p]f^ , 
thus 

(Var{B\V = p) - dpE [XY^\P = p] + {dpE [Y\P = p\f -dpE [(1 - X)Y^\P = p]^ 

< AdpE [(1 - X)Y^\P = p] dpE [XY'^\P = p] . 

The bounds obtained are sharp because the only inequahty comes from (5.4) and there could equality 

in the inequality. 

Q.E.D. 

5.2.3. Proof of Theorem 3.1. Similar computations as before yield 

E[Y\{S,U) = {s,u)]=E[A\+ [ E[B\f = j]f^{^)l{j^s<u}dj 

= E[A]+ r [ E[i?|r = 7]/r(7)dft.„(7)d^;, 

J —OO J Ps. v 

where the last identity holds because the Lebesgue measure on IR^~^ is the product of the Lebesgue 
measure on R^~^ and on M. As a consequence, for {u, s) in supp (([/, S)), 

^E[Y\ {S, U) = (s, u)] = R [E[B\f = •]/p(-)] {s, u) 

and 

^E[X\{S,U) = {s,n)] = R[ff{-)] {s,u). 

The equations in integral form imply both that, on the support of {S, U), the derivatives of the left 
hand-side regression functions exists and that they are images via the Radon transform of well defined 
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L^(M.^~^) functions. Assumption 3.3 implies that R E[i?|r = {s,u) and R [/p(-)] {s,u) are 

outside the support of (S, U) so is the natural extension for the left hand side expressions outside 
the support of (5, U). Based on these extensions, it is now possible to apply the inverse operator R^^. 
This yields that 



R- 



^E[Y\{S,U) = {■)] 



and 



Q.E.D. 



R~ 



^nx\is,u) = {■)] 



(7)=IE[i?|r = 7]/r(7) 
(7) = /r(7). 



5.2.4. Proof of Theorem 3.2. The proof is very similar to that of Theorem 2.2 and we first show that 



|^E[e**^X|(5,C/) = (s,^/)] = i? E[e^*^+^|f = -]/p(-) {s, 



R 



itA+B 



u 



E[e^'^|r = .]E[e^'^|r = .]/p(.) {s,u), 



where the second equality follows from the conditional independence assumption (3.4). 
Moreover 

±E[e^'^{l-X)\{S,U) = {s,u)] = -R [E[e**^|f = {s,u). 

Equation (3.3) has been proved in the proof of Theorem 3.1. 
Q.E.D 

5.2.5. Proof of Proposition 3.1. Consider the case of the numerator in (3.2). The result is based on 
the following computations. 



R 



-1 



d 



-E [e^*^X| (5, [/) = (•)] 



(7) 



KR{s^-f - u)E [e**^X|(5,C/) = (s,n)] duda{s) 



supp{S) 

knis^-f - n)E [(^'^X\{S,U) = {s.u)] ^^^^^^^P^duda{s) 

supp{{S,U)) /(S.C/) 



E 



supp{{S,U)) 



KRis' j -u)e'^^X 



f{s,u){s,u) 



iS,U) = is,u) 



f{S,U){s,u)duda{s) 



(5.5) 



E 



KR{S'j-U)e'''^X 



(using the law of iterated conditional expectations) 



f{s,u){S,U) 

where for the first equality we use the integration by parts formula and Assumption 3.6. 
Q.E.D. 
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5.2.6. Proof of Proposition 4-1- The proposition is a direct consequence of the relations ah — ah 
(a - a){b-h) + a(b - 6) + (a - a)b, (a + 6 + c)^ < 3(0^ + + c^) and the Holder inequality. 
Q.E.D. 

5.2.7. Proof of Proposition We introduce the notations 

2 1"^ 



/sir (^5 7) 



i?(t,7) 



2tt 



K{t hN,^)e-'^'(t>j^^f{t--f)dt 



/at 



The following decomposition holds by means of the Plancherel identity: 



/B|r-/B|r)(-;7) ,<4 (/^|f-/g|p)(.;7) 



+ 



TT 



(i,7) 



-^1 



M+_B,r 



'A+-B,r 



+ 



TT 



+ - 

TT 



oo 

2 



/ 



dt 



' A+BT 



it,!) 



dt 



oo 



K{thN,^f J^l fA+Bf (*'7) \R(.t,j)\^dt. 



We conclude using Lemma 5.1 below and the fact that by conditional independence 



' A+B,r 



fsf (*'7) 



^AT 



Q.E.D. 



Lemma 5.1 below is an adaptation of the lemma of Neumann (1997). Denote by 



1 . / , rA,N 

mm 1 



/a? (*'7) 



Lemma 5.1. 



sup _ {V(t,7)"'l^(i,7)l} = Op(l)- 
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5.2.8. Proof of Lemma 5.1. We distinguish between two cases. 



Case 1: Let t and 7 be such that 



/^p (t,7) < 2rA.N. Then, ^{t.-fY^ < 2 



fAV 



and 



it suffices to upper bound in probabiUty Ti f.f (^iT) 1-^(^)7)1- By definition of R{t,^) 



^ AV (*'7) l-^(*'7)l ^ 1 on the event 



fA (t,7) \m,^)\ < {rA,Nr^ 



^ AT 



< TA^N \ , while 



on the complementary event 



fAf 



> '"A.Af ( ■ This yields 



sup {^(t,7)"M^(i,7)l} =Op(l). 

(t-7): |.Fi[/^_p]{t,7)|<2rA,iv 



Case 2: Let now t and 7 be such that 



■f^AT 



(t, 7) > 2rA,7v. Then, i;{t, 7)-^ < 2 (r^.iv)"^ 



(i,7) 



and it suffices to upper bound in probability {rA,N) 
By definition of R{t,^), 



^1 



J^AT 



(t,7) \Rit,i)\. 



(rA, 



f AT 



< {rA,N) ^ ^\ fAV (*''>') 



/at 



+ 





f AT 


(t,7)--^i 




(i,7) 




-^1 




(i,7) 





f AT 



> rA,N 



Using 



Tx 



^AT 



< 



J^AT 



+ 









--^1 


fA,V 


)(i,7) 


J'x 




(t,7) 


Tx 


fA,V 


(t,7) 



we obtain 



< {rA.NY 



f AT 



(t,7) \m.i)\ 



Tx 
I 



Tx 



< rA.N 



+ {rA,N) ^ 
< {rA^Nr^ 



Tx 



f A.T 



(i,7)--^i 



f AT 



(i,7) 



Tx 



f AT 



{t.l)-^x 



f AT 



V 



:Fx 



fA,f (t^l) 



f AT 



> rA,N 



f AT 



:Fx 



^AT 



(i,7) 



< r 



A.N 
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{rA^NY 



f A,f 



^A,T 
f AT 



(t,7) 



^A,T 



(t,7) 



> rA,N 



From the definition of the upper bound on the rate n, the last term in the sum is, uniformly in t 



and 7 such that 



Moreover, because 



^1 



■f^AT 



(i,7) 



Iat (*'^) — '^^A,N, bounded in probability. 

/A,f] 

< rAM} < 1 



< 1 



< 2 



> 2rA,Ar, 

-^1 [fA,f 

-^1 /at 



•^A,r 

•^A,f 



(i,7) 
(i,7) 



> 



> 



■^A,r 

^ AX 



(i,7) 
(i,7) /2 







•^A,r 


--^1 


•/'a.f 


)(t,7) 




-^1 


•^A,r 


(t,7) 





which yields 
(rA,Jv)"^ -7^1 /a_p (t,7) 



fAV (*'7) 



< ?^A,Ar ^ < (?'A,Af) ^ 



thus the first term is also, uniformly in t and 7 such that 

probability. 

Q.E.D. 



•^A,r 



- -^1 [/a,? J J 7) ' 
> 2ryi,Ar, bounded in 



5.2.9. Proof of Proposition 4-3. The proposition follows from adapting the upper bounds in Comte 

and Lacour (2009), (4.2) and the assumptions made. 

Q.E.D. 
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